Understanding and Improving the Wang-Landau Algorithm 
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We present a mathematical analysis of the Wang-Landau algorithm, prove its convergence, identify 
sources of errors and strategies for optimization. In particular, we found the histogram increases 
uniformly with small fluctuation after a stage of initial accumulation, and the statistical error is 
found to scale as ^In / with the modification factor /. This has implications for strategies for 
obtaining fast convergence. 
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The Wang-Landau(WL) algorithm jli[has been applied 
to a number of interesting problems [l|, S H S M ■ 
It overcomes some difficulties in other Monte Carlo al- 
gorithms such as critical slowing down, and long relax- 
ation times due to frustration and complex energy ter- 
rain. Similar to the Metropolis algorithm, it is a generic 
algorithm, independent on the details of the physical sys- 
tem. Many methods have been suggested to iniprove the 
algorithm for certain types of systems 0, 0, 0|. The 
same mechanism also appears in the recent research of 
molecular dynamics simulations TTl. Among the stud- 
ies to characterize and improve the efficiency of the algo- 
rithm, Dayal et al.^^ shows the WL algorithm consider- 
ably reduces the tunneling time, and Trebst et al.^ pro- 
posed an algorithm that performs better in terms of tun- 
neling time. However, the WL algorithm has been used 
as an empirical method. Many important questions still 
remain unanswered: (i) How is flatness of the histogram 
related to the accuracy; (ii) what is the relation between 
the modification factor and error; and (iii) how does the 
simulation actually find out the density of states? The 
convergence of the WL algorithm should be guaranteed 
by a generic principle, in the same sense as the detailed 
balance assures the convergence of the Metropolis algo- 
rithm. However, the WL algorithm is different from the 
Metropolis algorithm, since it is not a Markov process. 

In this paper we present our study of this algorithm 
from an analytical approach, and try to answer those 
questions raised above. Our analysis provides a proof of 
the convergence of the method, estimation for the errors 
and the computational time, along with some strategies 
for optimization and parallelization. 

The goal of the WL algorithm is to accumulate knowl- 
edge about p{E) during a Metropolis-type MC sampling. 
The Metropolis-type random walk is characterized by an 
acceptance ratio min{l, g{Ej)/g(Ei)}, where g{E) is a 
function of energy, similar to the Boltzman factor in the 
usual Metropolis algorithm. Ei and Ej refer to energies 



* Current address: Computer Science and Mathematics Division, 
Oak Ridge National Laboratory, P.O. Box 2008, MS6164, Oak 
Ridge TN, 37830-6164, U.S.A., Center for Simulational Physics, 
University of Georgia, Athens, Georgia 30602, U.S.A. 



before and after this transition, the acceptance ratio bi- 
ases the free random walk and produces a final histogram 
h(E), which is related to the equilibrium distribution of 
the unbiased random walk p{E) by p{E)g{E) — h{E), 
provided that both sides of the identity are normalized. 
This identity is essentially a result of detailed balance. 
The WL algorithm divides g{Ej) by a modification fac- 
tor / after each transition, expecting g{E) to converge 
to 1/ p{E) and the histogram h{E) to be fiat. 

p{E) is a priori unknown in the simulation. We begin 
our analysis by clarifying the relevant parameters. Sup- 
pose the phase space of our physical model is divided 
into N macroscopic states with density (number of mi- 
croscopic configurations) pi > for macroscopic state 
i {i = 1,2 •••iV). (These macroscopic states could be 
labeled by energy, magnetization, or other macroscopic 
variables). Each microscopic configuration in the phase 
space uniquely belongs to one macroscopic state. The 
histogram hi{t) with 1 < i < is defined as the number 
of visits of each macroscopic state before t^^ step of the 
simulation. Initially ^.^(1) = 0, after macroscopic state k 
is visited at time t, hjjt -I- 1) = hi{t) + Sik- In the orig- 
inal implementation [j], one record is inserted into the 
histogram every Metropolis trial flip. However, in addi- 
tion to the modification factor, we have a second tunable 
parameter of the algorithm, separation S between suc- 
cessive records in the histogram, defined as the number 
of trial flips (random steps) that precede each increment 
of the histogram. S steps of random walk should be 
regarded as a single transition from initial macroscopic 
state i to the final macroscopic state j. Namely, the 
transition from i to j includes S trial fiips, each trial 
flip makes a transition from macroscopic state kn-i to 
macroscopic state kn (/cq = i and ks — j) with accep- 
tance ratio min{l, exp[ln /(/ife^ j^ (t) — ^fe„(^))]}- At time 
t, macroscopic state k has a probability (i) to be picked 
out by the simulation, pk (t) is normalized so that vector 

p{t) e Vn, where Vn ^ {x e [0, 1]^, E^^i a^fe = 1} is 
an N-dimensional simplex. In the following derivation, 
we assume S is larger than the autocorrelation length 
of the random walk, then i and j can be considered as 
independent random macroscopic states. The effect of 
autocorrelation will be discussed later. Although this as- 
sumption differs from real simulations, it is reasonable 
since the particular model under study is not specified. 
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It is also asymptotically accurate for any model when S 
is large enough. With this assumption, the probability 
distribution p{t) has an explicit expression, 

Mt) = z{t)-H,P^/o^)f-'^^('\ (1) 

where Z(t) = is the normalization 

constant, and without loss of generality, we insert an ini- 
tial guess 9i into the simulation. In fact serves as a 
guess for the simulation after t. Similarly the simulation 
can start from some "existing knowledge" represented by 
9i. If nothing is known about the density of states, then 
we start with 6i = 1. With the probability distribution 
Eq. , the macroscopic state fc has the probability (t) 
to be picked out in the next step. Once this happens, p(t) 
is changed to ■p{t + 1) by 

V^it + 1) - V^{t)r'''|{^ - Pk{t) + Pk{t)r\ (2) 

Here the Kronecker-(5 only suppresses pk (t) by a factor of 
/ and the denominator comes from the change in Z{t + 1) 
such that the normalization of probability is preserved. 
We point out that the evolution of p{t) is a Markov pro- 
cess, although the WL algorithm is not, because it makes 
references to its entire history. 

We will prove that the p{t) is attracted to the vicinity 
of imiform distribution (pi"-* = 1/-^) in the simulation. 
For this purpose, we define a measure of the difference 
between p(t) and the miiform distribution p^^"^ by 

N 

fi{t) = N\nN + ^\np,{t). (3) 

i=l 

One can check that < and ^{t) — only when 
p{t) = p'^^K After the macroscopic state k is picked out, 
Afi{t) = ^{t + 1) — fJ-{t) is given by 

A^l{t)=-lnf-Nhl[l-pk(t)+pk{t)r']■ (4) 

Obviously Afi{t) > —In/, and when pk{t) > iLf-i ~ 
ln//7V(l-/-i) (approximately 1/N when / 1), Afi{t) 
is always positive. This shows that increases, when 
the simulation picks out macroscopic states with prob- 
abilities above average. However there is a probability 
for fi{t) to decrease, in particular, at the center of at- 
traction {p{t) = p'-"-'), A/i(t) is negative. As a result, 
rather than converging to the uniform distribution, p{t) 
is expected to either fluctuate around uniform distribu- 
tion, or go away from it. Actually the second situation 
does not happen. To prove this, we first show that the 
expectation value Ep(t){A^(i)} (averaged over all possi- 
ble moves) has a lower bound determined by p{t) : Using 
Eq. |@J, the expectation value becomes 

^ 1 



Since < Pk{t){l — f~^) < 1, we use the inequality 
ln(l — x)^^ > X for X E (0, 1) to give a lower bound for 
the logarithm, which turns out to be 

N 

Ep(,){A/i(t)} >-\nf + N{l- r^)Y,pl{t). 

k=0 

Typically pfe (i) is of order l/N, where is a large integer, 
so this lower bound is very close to the actual value. This 
lower bound can be further expressed in terms of the 
Euclidean distance between p{t) and p'--^\ since \\p{t) — 
= ||p(t)|p — 1/A^, due to the normahzation of p{t). 
Therefore we have: 

Theorem 1 If \\p{t) ~ p^"^\^ ^ N-^[{1 - f-^)-^ In f - 
1] -|- e, with II • • • II being the Euclidean distance, the ex- 
pectation value of A/i(t) averaged over N possible moves 
is bounded from below Ep(j-){A/i(t)} > A^(l — /^^)e. 

Theorem Q] states that for a probability distribution 
p{t) outside the N-dimensional sphere i?e defined by its 
condition, fi{t) always has a tendency to increase. Next 
we consider an ensemble of simulations, whose p{t) has 
a certain distribution at time t, Ft{p). The ensem- 
ble averaged /x(t) is defined as (/x(t)) — J Ft{p)fi{p)dp. 
(note in the integrand we treat /i as a function of p 
instead of time t.) We want to show that the evolu- 
tion of Ft{p) brings every simulation in the ensemble 
into the sphere B^. Define D{p,p') as the probability 
of bringing distribution p' to p after one step. Obviously, 
Ft+i{p) ~ Jy^ F){p,p')Ft{p')dp' , where the integral over 
p' is restricted to the simplex Vat. 

We can express the ensemble average of fj.{t + 1), 
(flit + 1)> = J^^ Ft+iip)fi{p)dp, with Ftip): {fi{t + 1)) = 
ffy^ fi{p)D{p,p')Ft{p')dp'dp. As a result of Theorem 
^ if we assume at time t, every simulation is outside 
B„ i.e. p' ^ B„ then Jy^ fi{p)D{p,p')dp > n{p') + 

N{1 ~ /-i)e. Therefore (m(< + 1)) > Jyjl^ip') + Nil - 
f-^)e]Ft{p')dp' = {^{t)) + N{1 - /-i)e, which is the fol- 
lowing corollary: 

Corollary 2 If the distribution Ft does not enter B^, i.e. 
supp(Ft)nBe = 0, then (^(t + 1)) > {tiit))+N{l- f-^)e. 

This result implies that (/i(i)) increases at least linearly 
as the simulation goes. However, if supp(i^t) is always 
outside /i(i) is therefore bounded from above by the 
maximum value of ^{p) on the boundary of B^. This 
contradiction tells us that part of the ensemble has to 
be pushed into B^ so that supp(Ft) H B^ ^ 0. We can 
exclude those parts of the ensemble already inside B^, 
and apply the same inference to the remaining part of 
the ensemble which is still outside B^. The conclusion 
is, no matter where the simulation starts, p{t) sooner or 
later goes into B^. Once p{t) is in the vicinity of p^°\ 
it is unlikely to escape because A/Lt(t) has a lower bound 
A/i(i) > — In/. If after a certain step p{t) moves outside 
-Be, we can immediately use Corollary |21 to show that 
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it is attracted back into B^. It is this attraction towards 
p^°^ that reduces the tunnehng time of the WL algorithm 
[Tl |. When 7V(1 - /-I) < 1, this attraction is weak, 
which explains why Ref. finds the tunneling time of 
the WL algorithm saturates when / is less than a critical 
value determined by the system size. 

When p{t) is trapped near p^^\ the histogram shows a 
uniform growth with fluctuation. hi{t) = \ogf{pi/6i) + 
t/N + Tiit), where ri{t) is a random number with zero 
mean. The approximate density of states is measured 
from the histogram by = K9if'^^^^\ where K is 

a proper normalization constant. Figure ^ shows three 
snapshots of the histograms calculating the density of 
states of a two-dimensional Ising model on a lattice of 
size 32 by 32 with periodic boundary conditions. We 
have used / = in this simulation to reveal the fluctu- 
ation of the histogram in Fig. ^ The simulation vis- 
its energies (macroscopic states) with high density of 
states first, then extends the histogram to the whole 
spectrum. Once the whole spectrum is visited, the his- 
togram grows uniformly with small fluctuation. Two im- 
portant observations follow the results above: (i) A flat 
histogram is not required for convergence; the histogram 
is ready for calculation of p' when it reaches a threshold 
hi{t) ^ (Tri, for all i's, where denotes the standard 
deviation of ri(t). (ii) The statistical error can be re- 
duced by averaging over multiple results obtained with 
the same /, as well as reducing /. A proper estimation 
of the statistical error is from the condition of Theorem 
n Since p{t) fluctuates around a ball centered at p'"' 
of radius N''^/^TT{f) = ^/N-^[{1 - /-i)-iln/ - 1], we 
use 7r(/) to give an reasonable estimation of the statisti- 
cal error. p{t) is related to p'{t) by pi{t) — Cpip[^^{t), 
where C is a constant for all i's. Plugging this into 
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FIG. 1: Snapshots of the histogram of a single random walker 
on a two-dimensional 32 by 32 Ising model. Three thin curves 
are histograms of three sequences of lengths labeled in the 
figure. The thick dark line is the p(E) calculated from the 
last histogram (t = 76950), which overlaps with the exact 
p{E) within the accuracy of the width of the line on this 
figure. The length of the sequence is just the area under the 
histogram. The dashed staircase indicates a possible guess for 
p{E) with four energy intervals (see text). 
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where NC is a constant allowed by the WL algorithm, 
which can be absorbed into p'lit). The left side of this 
equation is the standard deviation of pij p\(€). So an ap- 
propriate estimation of the typical relative error Sp'^/p^ 
is 7r(/), which scales as VhT/ when / is close to 1. (The 
difference between the expression here {p' — p)/ p' and the 
standard definition of relative error (p' — p)/ p is negligi- 
ble, when the error is small.) Thus, we expect the fluctu- 
ation in the histogram to be ~ 7r(/)/ln/ ~ l/^ln /, 
because dp[{t) / dhi{t) — p[{t) In/. This has been recently 
confirmed by numerical tests 13]. Our strategy for a sin- 
gle iteration simulation is to run until a minimum num- 
ber of visits (at least l/-\/ln/) have been accumulated for 
each macroscopic state, followed by measurements sepa- 
rated by a short simulation which decorrelates (t) . Usu- 
ally 1 / \/In / visits on each macroscopic state is enough. 
With K measurements, the statistical error in \iip^{t) is 
reduced to ^J\nf/K. The total number of records in the 
histograms is thus at least: 



N 



NK 



(6) 



i=l 



The first term represents the number of records for the 
simulation to reach every macroscopic state. This term 
occupies the bulk of the histograms in Fig. The sec- 
ond term of Eq. © represents the cost of K uncorrelated 
measurements. The measurements can be parallelized on 
a number of processors. The dashed staircase in Fig. ^ 
shows what happens if the energy is divided into 4 equal 
intervals, as Ref. did for parallelization. As the spec- 
trum is divided into four intervals assigned to four sepa- 
rate simulations, an interval with high density of states 
does not have to wait for those with low density of states 
to be visited. The histogram represented by the first term 
of Eq. jnj is reduced to the area of 4 triangles bounded 
by the staircase and the last histogram above it in Fig.^ 
In terms of saving total computational time, an equiva- 
lent strategy is to use the staircase as an initial guess Oi . 
Thus, four triangles are filled simultaneously, equivalent 
to doing 4 simulations sequentially. Dividing the energy 
range causes boundary errors while a good initial 
guess of the functional form of the histogram does not 
have this problem. 

Assuming N is roughly proportional to the total num- 
ber of degrees of freedom, to the logarithm of maximum 
density of states, and to the number of MC steps to gen- 
erate an uncorrelated visit, the cost of CPU time for the 
first term in Eq. Q is of order 0{N^), while the cost 
of the second term is of order 0{N'^). (Logarithmic cor- 
rections might be present.) If we use a proper guess 9i 
to begin the simulation, the CPU time cost for the first 
term can be substantially reduced to O(iV^). 
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FIG. 2: p{M), density of states for magnetization M of 16 
Ising spins normalized to p{M = 16). The solid line connects 
the exact values, symbols were obtained with different param- 
eters / and S as explained in the text. Data was averaged 
over 100 measurements, so the statistical errors are smaller 
than the symbols. 

Now we discuss the effect of insufficient S which in- 
troduces autocorrelation between successive records in 
the histogram. At first sight, the total number of steps 
given by Eq. © is considerably reduced by using a large 
/. Multiple measurements also reduces the statistical 
error more quickly than reducing the value of /. So a 
small / seems to be unnecessary. However, there are 
systematic errors due to the correlation between adja- 
cent records in the histogram when / is not small, or 
the separation S not large enough. We illustrate this 
systematic error in Fig. |21 which shows the total den- 
sity of states for a fixed magnetization M, p{M), of the 
Ising model on a 4 x 4 lattice, for which the exact re- 
sult is: logio/o(M) = logio Cfg^^^^^ + (1 - <5oM)logio2. 
States with M and —M are grouped in the same macro- 
scopic state for a better statistics, so M is restricted to 
be a non-negative even integer. We demonstrate the ef- 
fect of this correlation in Fig. |2] by showing the result 
of an extremely biased scheme that restores all 16 spins 
to total alignment after each record. As expected, the 
result (shown as downward triangles) is biased towards 
M = 16. The simulation actually calculates the proba- 



bility of reaching state M starting from M = 16 within 
16 trial flips, which deviates from the correct density of 
states. On the other hand, The accuracy of data shown 
as squares indicates that the desired probability distri- 
bution Eq. 1^ is produced after 1600 trial flips. The 
data plotted with diamonds for a smaller / show that 
the systematic error is also reduced by letting / ^ 1, or 
equivalently, the minimum S required to eliminate the 
autocorrelation decreases with /. As an extreme case, if 
each trial flip is recorded {S = 1), for the Ising model of 
Fig. n] the histogram always grows quickly near E = 
and propagates to higher or lower energies very slowly. 
The systematic error due to the correlation is revealed, 
when the statistical error is reduced by multiple measure- 
ments with a single /. At this point, either a smaller /, 
or larger S is necessary to improve the accuracy. 

To summarize, we have given an proof of the conver- 
gence of the WL algorithm, and analyzed the sources of 
errors and optimization strategies. We find: (i) The den- 
sity of states is encoded in the average histogram; (ii) The 
fluctuation of the histogram, proportional to /, 
where / is the modiflcation factor, causes statistical er- 
ror, which can be reduced by averaging over multiple 
p'{t). (iii) The correlation between adjacent records in 
the histogram introduces a systematic error, which is re- 
duced by small /, and also by minimizing the correlation, 
e.g. using large S, or cluster algorithms J^J. These flnd- 
ings suggest that numerical simulations can start with a 
large /, e.g. e'*, and then reduce / in large steps in each 
stage, e.g. divide In / by a factor of 10. Multiple mea- 
surements can be made in the final stage to reduce the 
statistical error effectively. However, results calculated 
with a single pair of / and S is prone to systematic er- 
ror. One can extrapolate results calculated with different 
/ and 5' to / = 1 or 5 = oo to eliminate this error. If the 
error is believed to be small enough, one can also reduce 
/ to 1 directly, which results in a histogram proportional 
to pi/p'^{t), which is the ratio of the true density of states 
to the numerical results. 
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